Search CORE

18 research outputs found

Towards Accurate Multi-person Pose Estimation in the Wild

Author: Bregler Chris
Kanazawa Nori
Murphy Kevin
Papandreou George
Tompson Jonathan
Toshev Alexander
Zhu Tyler
Publication venue
Publication date: 14/04/2017
Field of study

We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Trained on COCO data alone, our final system achieves average precision of 0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming the winner of the 2016 COCO keypoints challenge and other recent state-of-art. Further, by using additional in-house labeled data we obtain an even higher average precision of 0.685 on the test-dev set and 0.673 on the test-standard set, more than 5% absolute improvement compared to the previous best performing method on the same dataset.Comment: Paper describing an improved version of the G-RMI entry to the 2016 COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016). Camera ready version to appear in the Proceedings of CVPR 201

arXiv.org e-Print Archive

Crossref

Spatial Motion Doodles: Sketching Animation in VR Using Hand Gestures and Laban Motion Analysis

Author: Bouchard D.
Choi G.
Hyun Kyunglyul
J.
Kovar Lucas
Lorenzo Torresani Chris Bregler
Masuda Megumi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/10/2019
Field of study

International audienceWe present a method for easily drafting expressive character animation by playing with instrumented rigid objects. We parse the input 6D trajectories (position and orientation over time)-called spatial motion doodles-into sequences of actions and convert them into detailed character animations using a dataset of parameterized motion clips which are automatically fitted to the doodles in terms of global trajectory and timing. Moreover, we capture the expres-siveness of user-manipulation by analyzing Laban effort qualities in the input spatial motion doodles and transferring them to the synthetic motions we generate. We validate the ease of use of our system and the expressiveness of the resulting animations through a series of user studies, showing the interest of our approach for interactive digital storytelling applications dedicated to children and non-expert users, as well as for providing fast drafting tools for animators

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Recognizing multimodal entailment

Author: Barcik Gabriek
Bratanič Blaž
Bregler Chris
Bulian Jannis
Cao Qin
Ferreira Felipe
Frank Jared
Gopalan Arjun
Ilharco Cesar
Ilharco Gabriel
Imbrasaite Vaiva
Leung Thomas
Liu Christina
Marino Ricardo
Nagrani Arsha
Osang Georg F
Patel Roma
Shirazi Afsaneh
Smaira Lucas
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web. With ever-larger amounts of unstructured and limited labels, organizing and reconciling information from different sources and modalities is a central challenge in machine learning. This cutting-edge tutorial aims to introduce the multimodal entailment task, which can be useful for detecting semantic alignments when a single modality alone does not suffice for a whole content understanding. Starting with a brief overview of natural language processing, computer vision, structured data and neural graph learning, we lay the foundations for the multimodal sections to follow. We then discuss recent multimodal learning literature covering visual, audio and language streams, and explore case studies focusing on tasks which require fine-grained understanding of visual and linguistic semantics question answering, veracity and hatred classification. Finally, we introduce a new dataset for recognizing multimodal entailment, exploring it in a hands-on collaborative section. Overall, this tutorial gives an overview of multimodal learning, introduces a multimodal entailment dataset, and encourages future research in the topic

IST Austria: PubRep (Institute of Science and Technology)

Paul Debevec's SIGGRAPH99 course no 39 on Image-Based Modeling and Rendering Video Based Animation Techniques for Human Motion

Author: Chris Bregler
Publication venue
Publication date
Field of study

scenes, or architectural scenes. Explicit geometric structures are combined with image data. Texture mapping and view morphing are simple examples. We can generate new images from a collection of recorded images. Simple geometry dictates coarse transformations of fine grained image texture. New views of a scene can be generated in blending between the transformed example textures. This is a trade-off between explicit structure (collection of views and geometric model) and implicit example data (the image texture). Such trade-offs are applied to other domains as well. The most successful speech production systems (text-to-speech, concatenative speech) follow a similar philosophy. A collection of annotated example sounds are used to create new sounds. A sentence is build from phonemes (explicit structure). To blend the phonemes together, the sound examples are pitch and time warped (implicit data). We will show how this extends to video data and human motion animation. Structure vs Data for Animation: So far most graphical animation techniques do not exploit such trade-offs between explicit structure and implicit data. Many facial and body animations are generated by 3D volumetric models and physical simulations. Some facial animation systems texture map images onto the geometric model, or morp

CiteSeerX

Performance driven facial animation using blendshape interpolation

Author: Chris Bregler
Erika Chuang
Publication venue
Publication date: 01/01/2002
Field of study

This paper describes a method of creating facial animation using a combination of motion capture data and blendshape interpolation. An animator can design a character as usual, but use motion capture data to drive facial animation, rather than animate by hand. The method is effective even when the motion capture actor and the target model have quite different shapes. The process consists of several stages. First, computer vision techniques are used to track the facial features of a talking actress in a video recording. Given the tracking data, our system automatically discovers a compact set of key-shapes that model the characteristic motion variations. Next, the facial tracking data is decomposed into a weighted combination of the key shape set. Finally, the user creates corresponding target key shapes for an animated face model. A new facial animation is produced by using the same weights as recovered during facial decomposition, and interpolated with the new key shapes created by the user. The resulting facial animation resembles the facial motion in the video recording, while the user has complete control over the appearance of the new face. 1

CiteSeerX

Finding Pictures of Objects in Large Collections of Images

Author: Bregler Chris
Carson C.
Fleck Margaret M.
Forsyth David A.
Greenspan H.
Leung Thomas K.
Publication venue: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Publication date: 01/01/1997
Field of study

"Retrieving images from very large collections using image content as a key is becoming an important problem. Users prefer to ask for pictures using notions of content that are strongly oriented to the presence of objects, which are quite abstractly defined. Computer programs that implement these queries automatically are desirable but are hard to build be-cause conventional object recognition techniques from computer vision cannot recognize very general objects in very general contexts. This paper describes an approach to object recognition structured around a sequence of increasingly specialized grouping activities that assemble coherent regions of image that can be shown to satisfy increasingly stringent constraints. The constraints that are satisfied provide a form of object classification in quite general contexts. This view of recognition is distinguished by far richer involvement of early visual primitives, including color and texture; the ability to deal with rather general objects in uncontrolled configurations and contexts; and a satisfactory notion of classification. These properties are illustrated with three case studies: one demonstrates the use of descriptions that fuse color and spatial properties; one shows how trees can be de-scribed by fusing texture and geometric properties; and one shows how this view of recognition yields a program that can tell, quite accurately, whether a picture contains naked people or not."published or submitted for publicatio

Illinois Digital Environment for Access to Learning and Scholarship Repository

COSMOS: Catching Out-of-Context Image Misuse Using Self-Supervised Learning

Author: Aneja Shivangi
Bregler Chris
Niessner Matthias
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. We propose a new method that automatically highlights out-of-context image and text pairs, for assisting fact-checkers. Our key insight is to leverage the grounding of images with text to distinguish out-of-context scenarios that cannot be disambiguated with language alone. We propose a self-supervised training strategy where we only need a set of captioned images. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check if both captions correspond to the same object(s) in the image but are semantically different, which allows us to make fairly accurate out-of-context predictions. Our method achieves 85% out-of-context detection accuracy. To facilitate benchmarking of this task, we create a large-scale dataset of 200K images with 450K textual captions from a variety of news websites, blogs, and social media post

Association for the Advancement of Artificial Intelligence: AAAI Publications